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Abstract. We propose to model the text classification process as a se- 
T-H quential decision process. In this process, an agent learns to classify docu- 

T-H ments into topics while reading the document sentences sequentially and 

^^ learns to stop as soon as enough information was read for deciding. The 

Cn proposed algorithm is based on a modelisation of Text Classification as 

(^JQ a Markov Decision Process and learns by using Reinforcement Learning. 

[~{ Experiments on four different classical mono-label corpora show that the 

.^^ proposed approach performs comparably to classical SVM approaches 

for large training sets, and better for small training sets. In addition, 
Oi the model automatically adapts its reading process to the quantity of 

training information provided. 
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1 Introduction 



Text Classification (TC) is the act of taking a set of labeled text documents, 
learning a correlation between a document's contents and its corresponding la- 
bels, and then predicting the labels of a set of unlabeled test documents as best 
as possible. TC has been studied extensively, and is one of the older specialties 
of Information Retrieval. Classical statistical TC approaches are based on well- 
/^ known machine learning models such as generative models — Naive Bayes for 

^NJ example [1][2] — or discriminant models such as Support Vector Machines [3]. 

CO They mainly consider the hag of words representation of a document (where the 

' \ order of the words or sentences is lost) and try to compute a category score by 

t^^ looking at the entire document content. Linear SVMs in particular — especially 

^^ for multi-label classification with many binary SVMs — have been shown to 

work particularly well [4]. Some major drawbacks to these global methods have 
. . been identified in the literature: 

^> 

k> — These methods take into consideration a document's entire word set in order 

}_j to decide to which categories it belongs. The underlying assumption is that 

Cd the category information is homogeneously dispatched inside the document. 

This is well suited for corpora where documents are short, with little noise, 
so that global word frequencies can easily be correlated to topics. However, 
these methods will not be well suited in predicting the categories of large 
documents where the topic information is concentrated in only a few sen- 
tences. 
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— Additionally, for these methods to be applicable, the entire document must 
be known at the time of classification. In cases where there is a cost associated 
with acquiring the textual information, methods that consider the entire 
document cannot be efficiently or reliably applied as we do not know at 
what point their classification decision is well-informed while considering 
only a subset of the document. 

Considering these drawbacks, some attempts have been made to use the se- 
quential nature of these documents for TC and similar problems such as passage 
classification. The earliest models developed especially for sequence processing 
extend Naive Bayes with Hidden Markov Models. Denoyer et al. [5] propose an 
original model which aims at modeling a document as a sequence of irrelevant 
and relevant sections relative to a particular topic. In [6], the authors propose 
a model based on recurrent Neural Networks for document routing. Other ap- 
proaches have proposed to extend the use of linear SVMs to sequential data, 
mainly through the use of string kernels [7]. Finally, sequential models have 
been used for Information Extraction [8,9], passage classification [10,11], or the 
development of search engines [12,13]. 

We propose a new model for Text Classification that is less affected by the 
aforementioned issues. Our approach models an agent that sequentially reads a 
text document while concurrently deciding to assign topic labels. This is modeled 
as a sequential process whose goal is to classify a document by focusing on its 
relevant sentences. The proposed model learns not only to classify a document 
into one or many classes, but also when to label, and when to stop reading the 
document. This last point is very important because it means that the systems is 
able to learn to label a document with the correct categories as soon as possible, 
without reading the entire text. 

The contributions of this paper are three-fold: 

1. We propose a new type of sequential model for text classification based on 
the idea of sequentially reading sentences and assigning topics to a document. 

2. Additionally, we propose an algorithm using Reinforcement Learning that 
learns to focus on relevant sentences in the document. This algorithm also 
learns when to stop reading a document so that the document is classified 
as soon as possible. This characteristic can be useful for documents where 
sentence acquisition is expensive, such as large Web documents or conversa- 
tional documents. 

3. We show that on popular text classification corpora our model outperforms 
classical TC methods for small training sets and is equivalent to a baseline 
SVM for larger training sets while only reading a small portion of the doc- 
uments. The model also shows its ability to classify by reading only a few 
sentences when the classification problem is easy (large training sets) and to 
learn to read more sentences when the task is harder (small training sets). 

This document is organized as follows: In Section 2, we present an overview of our 
method. We formalize the algorithm as a Markov Decision Process in Section 

3 and detail the approach for both multi-label and mono-label TC. We then 
present the set of experiments made on four different text corpora in Section 4. 
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2 Task Definition and General Principles of the Approach 

Let T) denote the set of all possible textual documents, and y the set of C 
categories numbered from 1 to C . Each document rf in 2? is associated with one 
or many^ categories of C. This label information is only known for a subset of 
documents 'Dtrain C T) called training documents, composed oi Ntrain documents 
denoted V train — [di, ■■■, dn train ) ■ The labels of document di are given by a vector 
of scores y* = {y\, ...,y}j). We assume that: 

{1 if di belongs to category k 
otherwise 

The goal of TC is to compute, for each document d m V, the corresponding 
score for each category. The classification function fg with parameters 6 is thus 
defined as : 

[d-^y" 

Learning the classifier consists in finding an optimal parameterization 9* that 
reduces the mean loss such that: 



r=argmin-^ J2 i(/fl(d.),/0, (3) 



where L is a loss function proportional to the classification error of fe{di). 
2.1 Overview of the approach 

This section aims to provide an intuitive overview of our approach. The ideas 
presented here are formally presented in Section 3, and will only be described in 
a cursory manner below. 

Inference We propose to model the process of text classification as a sequential 
decision process. In this process, our classifier reads a document sentence-by- 
sentence and can decide — at each step of the reading process — if the document 
belongs to one of the possible categories. This classifier can also chose to stop 
reading the document once it considers that the document has been correctly 
categorized. 

In the example described in Fig. 1, the task is to classify a document com- 
posed of 4 sentences. The documents starts off unclassified, and the classifier 
begins by reading the first sentence of the document. Because it considers that 
the first sentence does not contain enough information to reliably classify the 
document, the classifier decides to read the following sentence. Having now read 
the first two sentences, the classifier decides that it has enough information at 
hand to classify the document as cocoa. 

^ In this article, we consider both the mono-label classification task, where each doc- 
ument is associated with exactly one category, and the multi-label task where a 
document can be associated with several categories. 
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Fig. 1. Inference on a document 



The classifier now reads the third sentence and — considering the informa- 
tion present in this sentence — decides that the reading process is finished; the 
document is therefore classified in the cocoa category. 

Had the document belonged to multiple classes, the classifier could have 
continued to assign other categories to the document as additional information 
was discovered. 

In this example, the model took four actions: next, classify as cocoa, next and 
then stop. The choice of each action was entirely dependent on the corresponding 
state of the reading process. The choice of actions given the state, such as those 
picked while classifying the example document above, is called the policy of 
the classifier. This policy — denoted tt — consists of a mapping of states to 
actions relative to a score. This score is called a Q-value — denoted Q{s, a) — 
and reflects the worth of choosing action a during state s of the process. Using 
the Q-value, the inference process can be seen as a greedy process which, for 
each timestep, chooses the best action a* defined as the action with the highest 
score w.r.t. Q{s, a): 

a* — argmax(3(s, a). (4) 



Training The learning process consists in computing a Q- function? which 
minimizes the classification loss (as in equation (3)) of the documents in the 
training set. The learning procedure uses a monte-carlo approach to find a set of 
good and bad actions relative to each state. Good actions are actions that result 
in a small classification loss for a document. The good and bad actions are then 
learned by a statistical classifier, such as an SVM. 

An example of the training procedure on the same example document as 
above is illustrated in Fig 2. To begin with, a random state of the classification 
process is picked. Then, for each action possible in that state, the current policy 
is run until it stops and the final classification loss is computed. The training 
algorithm then builds a set of good actions — the actions for which the simu- 
lation obtains the minimum loss value — and a set of remaining bad actions. 



The Q-function is an approximation of Q{s,a). 
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This is repeated on many different states and training documents until, at last, 
the model learns a classifier able to discriminate between good and bad actions 
relative to the current state. 
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Fig. 2. Learning the sequential model. The different steps of one learning iteration are 
illustrated from left to right on a single training document. 



2.2 Preliminaries 

We have presented the principles of our approach and given an intuitive de- 
scription of the inference and learning procedures. We will now formalize this 
algorithm as a Markov Decision Process (MDP) for which an optimal policy is 
found via Reinforcement Learning. Note that we will only go over notations per- 
tinent to our approach, and that this section lacks many MDP or Reinforcement 
Learning definitions that are not necessary for our explanation. 

Markov Decision Process A Markov Decision Process (MDP) is a math- 
ematical formalism to model sequential decision processes. We only consider 
deterministic MDPs, defined by a 4-tuple: {S, A, T, r). Here, S is the set of pos- 
sible states, A is the is the set of possible actions, and T : S x A^ S \s the state 
transition function such that T{s, a) — >■ s' (this symbolizes the system moving 
from state s to state s' by applying action a). The reward, r : S* x ^4 — > M, 
is a value that reflects the quality of taking action a in state s relative to the 
agent's ultimate goal. We will use A{s) Q Aio refer to the set of possible actions 
available to an agent in a particular state s. 

An agent interacts with the MDP by starting off in a state s € iS. The agent 
then chooses an action a G A{s) which moves it to a new state s' by applying 
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the transition T{s, a). The agent obtains a reward r{s, a) and then continues the 
process until it reaches a terminal state s final where the set of possible actions 
is empty i.e A{sfinai) = 0- 

Reinforcement Learning Let us define tt : S ^ A, a. stochastic policy such 
that Va € A{s), 7r(s) = a with probability P{a\s). The goal of RL is to find an 
optimal policy tt* that maximizes the cumulative reward obtain by an agent. 
We consider here the finite-horizon context for which the cumulative reward 
corresponds to the sum of the reward obtained at each step by the system, 
following the policy tt. The goal of Reinforcement Learning is to find an optimal 
policy denoted tt* which maximizes the cumulative reward obtained for all the 
states of the process i.e.: 

T 

TT* = argmax ^ E^[^ r(st, a^)]. (5) 

'^ soes t=o 

Many algorithms have been developed for finding such a policy, depending 
on the assumptions made on the structure of the MDP, the nature of the states 
(discrete or continuous), etc. In many approaches, a policy tt is defined through 
the use of a Q- function which refiects how much reward one can expect by 
taking action a on state s. With such a function, the policy tt is defined as: 

TT = argmax Q(s, a). (6) 

aeA(s) 

In such a case, the learning problem consists in finding the optimal value Q* 
which results in the optimal policy tt* . 

Due to the very large number of states we are facing in our approach, we 
consider the Approximated Reinforcement Learning context where the Q func- 
tion is approximated by a parameterized function Qg(s,a), where is a set of 
parameters such that: 

Qeis,a) ^<e,^{s,a)>, (7) 

where < •, • > denotes the dot product and ^(s, a) is a feature vector representing 
the state-action pair (s, a). The learning problem consists in finding the optimal 
parameters 9* that results in an optimal policy: 

TT* = argmax < 6*,(l>{s,a) > . (8) 

aeA(s) 



3 Text Classification as a Sequential Decision Problem 

Formally, we consider that a document d is composed of a sequence of sentences 
such that d = ((5j^ , . . . , (5^ ), where (5^ is the i-th sentence of the document and 

nd is the total number of sentences making up the document. Each sentence 5^ 
has a corresponding feature vector — a normalized tf-idf vector in our case — 
that describes its content. 
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3.1 MDP for Multi-Label Text Classification 

We can describe our sequential decision problem using an MDP. Below, we de- 
scribe the MDP for the multilabel classification problem, of which monolabel 
classification is just a specific instance: 

— Each state s is a triplet {d,p,y) such that: 

• d is the document the agent is currently reading. 

• ]5 e [liJ^d] corresponds to the current sentence being read; this implies 
that 5^ to 5p_i have already been read. 

• y is the set of currently assigned categories — categories previously as- 
signed by the agent during the current reading process — where 2/^ = 1 
iff the document has been assigned to category k during the reading 
process, otherwise. 

— The set of actions A{s) is composed of: 

• One or many classification actions denoted classify as k for each cate- 
gory k where ijk = 0. These actions correspond to assigning document d 
to category k. 

• A next sentence action denoted next which corresponds to reading the 
next sentence of the document. 

• A stop action denoted stop which corresponds to finishing the reading 
process. 

— The set of transitions T{s, a) act such that: 

• T{s, classify as k) sets yk -s— 1. 

• r(s, next) sets p -^ p+1. 

• r(s, stop) halts the decision process. 

— The reward r{s^a) is defined as: 

, , ( Fi{y,y) if a is a stop action , , 

'^(s,aj = |Q otherwise ' ^' 

where y is the real vector of categories for d and y is the predicted vector of 
categories at the end of the classification process. The Fi score of a single 
document is defined as: 

FM)^2. f'ylj^y^y}. m 

P\y,y) + r(y,y) 
with (11) 

c c c c 

P{y,y) ^^HVk ^yk)/^yk and r{y,y) ^^t{yl ^yk)/^yk- (12) 

k=0 fc=0 k=0 k=0 



MDP for Mono-Label Text Classification In mono-label classification, we 
restrict the set of possible actions. The classify as k action leads to a stopping 
state such that A{s) = {stop}. This brings the episode to an end after the attri- 
bution of a single label. Note that in the case of a mono-label system — where 
only one category can be assigned to a document — the reward corresponds to a 
classical accuracy measure: 1 if the chosen category is correct, and otherwise. 
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3.2 Features over states 

We must now define a feature function whicli provides a vector representation of 
a state-action pair (s, a). The purpose of this vector is to be able to present (s, a) 
to a statistical classifier to know whether it is good or bad. Comparing the scores 
of various (s, a) pairs for a given state s allows us to choose the best action for 
that state. 

Classical text classification methods only represent documents by a global — 
and usually tf-idf weighted — vector. We choose, however, to include not only 
a global representation of the sentences read so far, but also a local component 
corresponding to the most recently read sentence. Moreover, while in state s, a 
document may have been already assigned to a set of categories; the global fea- 
ture vector <?(s, a) must describe all this information. The vector representation 
of a state s is thus defined as <P{s): 



(isf 



^{s) = 



\ 



5i 



ya---yc 



(13) 



V / 



<?(s) is the concatenation of a set of sub-vectors describing: the mean of the 
feature vectors of the read sentences, the feature vector of the last sentence, and 
the set of already assigned categories. 

In order to include the action information, we use the block-vector trick 
introduced by [14] which consists in projecting ^(s) into a higher dimensional 
space such that: 

<Z'(s,a) = (O...0(s)...O). (14) 

The position of <P{s) inside the global vector <?(s, a) is dependent on action a. 
This results in a very high dimensional space which is easier to classify in with 
a linear model. 

3.3 Learning and Finding the optimal classification policy 

In order to find the best classification policy, we used a recent Reinforcement 
Learning algorithm called Approximate Policy Iteration with Rollouts. In brief, 
this method uses a monte-carlo approach to evaluate the quality of all the pos- 
sible actions amongst some random sampled states, and then learns a classifier 
whose goal is to discriminate between the good and bad actions relative to each 
state. Due to a lack of space, we do not detail the learning procedure here and 
refer to the paper by Lagoudakis et al [15]. An intuitive description of the pro- 
cedure is given in Section 2.1. 

4 Experimental Results 

We have applied our model on four different popular datasets: three mono-label 
and one multi-label. All datasets were pre-processed in the same manner: all 
punctuation except for periods were removed, SMART stop- words [16] and words 
less than three characters long were removed, and all words were stemmed with 
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Porter stemming. Baseline evaluations were performed with libSVM[17] on nor- 
malized tf-idf weighted vectorial representations of each document as has been 
done in [3]. Published performance benchmarks can be found in [18] and [19]. 
The datasets are: 

— The Reuters-21578^ dataset which provides two corpora: 

• The ReutersS corpus, a mono-label corpus composed of the 8 largest 
categories. 

• The ReuterslO corpus, a multi-label corpus composed of the 10 largest 
categories. 

— The WcbKB''[20] dataset is a mono-label corpus composed of Web pages 
dispatched into 4 different categories. 

— The 20 Newsgroups^ (20NG) dataset is a mono-label corpus of news com- 
posed of 20 classes. 



Corpus 


Nb of documents 


Nb of categories 


Nb of sentences by doc. 


Task 


R8 


7678 


8 


8.19 


Mono-label 


RIO 


12 902 


10 


9.13 


Multi-label 


Newsgroup 


18 846 


20 


22.34 


Mono-label 


WebKB 


4 177 


4 


42.36 


Mono-label 



Table 1. Corpora statistics. 

4.1 Evaluation Protocol 

Many classification systems are soft classification systems that compute a score 
for each possible category-document pair. Our system is a hard classification 
system that assigns a document to one or many categories, with a score of either 1 
or 0. The evaluation measures used in the litterature, such as the breakeven point, 
are not suitable for hard classification models and cannot be used to evaluate 
and compare our approach with other methods. We have therefore chosen to 
use the micro-Fl and macro-Fl measures. These measures correspond to a 
classical Fi score computed for each category and averaged over the categories. 
The macro-Fl measure does not take into account the size of the categories, 
whereas the micro-Fl average is weighted by the size of each category. We 
averaged the different models' performances on various train/test splits that 
were randomly generated from the original dataset. We used the same approach 
both for evaluating our approach and the baseline approaches to be able to 
compare our results properly. For each training size, the performance of the 
models were averaged over 5 runs. The hyper-parameters of the SVM and the 
hyper-parameters of the RL-based approach were manually tuned. What we 
present here arc the best results obtained over the various parameter choices 
we tested. For the RL approach, each policy was learned on 10,000 randomly 
generated states, with 1 rollout per state, using a random initial policy. It is 
important to note that, in a practical sense, the RL method is not much more 
complicated to tune than a classical SVM since it is rather robust regarding the 
values of the hyper-parameters. 

^ http://web.ist.utl.pt/%7Eacardoso/datasets/ 

'^ http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/ 

^ http://people.csail.mit.edu/jrennie/20Newsgroups/ 
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4.2 Experimental Results 

Our performance figures use SVM to denote baseline Support Vector Machine 
performance, and STC (Sequential Text Classification) to denote our approach. 
In the case of the mono-label experiments (Figure 3 and 4- left), performance of 
both the SVM method and our method are comparable. It is important to note, 
however, that in the case of small training sizes (1%, 5%), the STC approach 
outperforms SVM by 1-10% depending on the corpus. For example, on the R8 
dataset we can see that for both Fl scores, STC is better by ~ 5% with a 
training size of 1%. This is also visible with the NewsGroup dataset, where STC 
is better by 10% for both metrics using a 1% training set. This shows that STC 
is particularly advantageous with small training sets. 

The reading process' behaviour is explored in Figure 5. Here, Reading Size 
corresponds to the mean percentage of sentences read for each document^. We 
can see that Reading Size decreases as the training size gets bigger for mono-label 
corpora. This is due to the fact that the smaller training sizes are harder to learn, 
and therefore the agent needs more information to properly label documents. In 
the right-hand side of Figure 5, we can see a histogram of number of documents 
grouped by Reading Size. We notice that although there is a mean Reading Size 
of 41%, most of the documents are barely read, with a few outliers that are read 
until the end. The agent is clearly capable of choosing to read more or less of 
the document depending on its content. 

In the case of multi-label classification, results are quite different. First, we see 
that for the RIO corpus, our model's performance is lower than the baseline on 
large training sets. Moreover, the multi-label model reads all the sentences of the 
document during the classification process. This behaviour seems normal because 
when dealing with multi-label documents, one cannot be sure that the remaining 
sentences will not contain relevant information pertaining to a particular topic. 
We hypothesize that the lower performances are due to the much larger action 
space in the multi-label problem, and the fact that we are learning a single model 
for all classes instead of one independent models per class. 

5 Conclusions 

We have presented a new model that learns to classify by sequentially reading 
the sentences of a document, and which labels this document as soon as it has 
collected enough information. This method shows some interesting properties 
on different datasets. Particularly in mono-label TC, the model automatically 
learns to read only a small part of the documents when the training set is large, 
and the whole documents when the training set is small. It is thus able to adapt 
its behaviour to the difficulty of the classification task, which results in obtaining 
faster systems for easier problems. The performances obtained are close to the 
performance of a baseline SVM model for large training sets, and better for small 
training sets. 



If li is the number of sentences in document i read during the classification process, 
and Tii is the total number of sentences in this document. Let A'^ be the number of 
test documents, then the reading size value is 4? y~! ^-. 
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Fig. 3. Performances over the R8 Corpus (left) and NewsGroup Corpus (right) 
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This work opens many new perspectives in the Text Classification domain. 
Particularly, it is possible to imagine some additional MDP actions for the clas- 
sification agent allowing the agent to parse the document in a more complex 
manner. For example, this idea can be extended to learn to classify XML docu- 
ments reading only the relevant parts. 
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